Made by Pedro Concejero for coursera and based on previous work in Madrid R users group
Code and dataset available upon request from pedro.concejerocerezo at gmail.com
This is the document to fulfill peer assessment homework, in its version 1, “empirical network analysis”.
This document is done in RStudio using knitr markdown language and mainly based in igraph R library for SNA.
Explanations on how to use igraph for producing R graph objects are embedded within the explanations of the data nad objectives of the analysis.
# Requesting required libraries required for SNA
library(igraph)
# required to produce some plots
library(gplots)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
# If you are doing this analysis and have the dataset (ask me if you want
# it) specify your working directory here
setwd("D:/2013/enron")
# enron.RData file contains the working space with several R objects that
# are explained and described below
load("enron.RData")
load("edges_w_message.RData")
The enron skandal revealed in 2001 and was the most expensive bankruptcy produced till that date (many more expensive ones have happened afterwards). An excellent reference on the enron history can be found in wikipedia:
http://en.wikipedia.org/wiki/Enron_scandal
After the company’s collapse a large database of over 600,000 emails generated by 158 employees of the Enron Corporation was acquired by the Federal Energy Regulatory Commission during its investigation after the company’s collapse. A copy of the database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst, who released this copy to researchers as the “Enron corpus”. This analysis is based on this dataset. More about the Enron corpus can be consulted at:
http://en.wikipedia.org/wiki/Enron_Corpus
More in particular, the dataset object of analysis here is based on a mySql implementation of all e-mails between the 158 enron employees and all the rest of the world (except private emails that were deleted previously by the database owners). Since this dataset would be rather difficult to use in an educational setting, the dataset was restricted to the emails between enron employees, thus reducing considerably the dataset size and making it easier the interpretation of links and other practical issues.
This dataset was created from a version of the enron corpus by Jitesh Shetty and Jafar Adibi available here: http://www.isi.edu/~adibi/Enron/Enron.htm
This is the origin of the two dataframes that are required to produce an igraph R graph object: edges (or links, in this case, e-mails), and nodes.
Edges is a dataframe containing the links. For igraph it is essential that two first columns found in this dataframe are node id’s -usually first one is sender and second is receiver-. So we have the following information in the edges dataframe: - sender: e-mail address of sender - receiver: e-mail address of receiver - type of e-mail (CC, BCC, TO) - subject: string with the subject of e-mail - body: full text of e-mail message - date
# Number of edges -or e-mails- included in dataset
nrow(edges.full)
## [1] 61673
# Description of the edges object
str(edges.full)
## 'data.frame': 61673 obs. of 6 variables:
## $ sender : chr "mary.hain@enron.com" "mary.hain@enron.com" "mary.hain@enron.com" "cooper.richey@enron.com" ...
## $ receiver: chr "sean.crandall@enron.com" "mike.swerzbin@enron.com" "robert.badeer@enron.com" "robert.badeer@enron.com" ...
## $ type : chr "TO" "TO" "TO" "TO" ...
## $ subject : chr "Enron s transmission/power exchange model for discussion" "Enron s transmission/power exchange model for discussion" "Enron s transmission/power exchange model for discussion" "Change to EnData" ...
## $ body : chr "---------------------- Forwarded by Mary Hain/HOU/ECT on 08/17/2000 02:15 PM ---------------------------James D Steffes@EES08/1"| __truncated__ "---------------------- Forwarded by Mary Hain/HOU/ECT on 08/17/2000 02:15 PM ---------------------------James D Steffes@EES08/1"| __truncated__ "---------------------- Forwarded by Mary Hain/HOU/ECT on 08/17/2000 02:15 PM ---------------------------James D Steffes@EES08/1"| __truncated__ "The Fundamentals Group is moving Database servers and the existing EnData Excel Add-Inneeds to be changed. If you use Endata, "| __truncated__ ...
## $ date : chr "2000-08-17 07:11:00" "2000-08-17 07:11:00" "2000-08-17 07:11:00" "2000-08-23 04:39:00" ...
# Re-formatting date so that we can use dates in R
edges.full$date.R <- as.POSIXct(edges.full$date)
Note that date is a string, because gephi does not understand the exported R-date format.
The other required object to produce an igraph graph object is the nodes dataframe. This contains all the info about the nodes, in our case, the enron employees who were e-mail sender or receivers.
This dataframe contains e-mail address as node id, the lastName as a useful string for labelling, and her/his status in the company (if this info. was available).
# Number of nodes
nrow(nodes)
## [1] 149
# Description of the nodes object
str(nodes)
## 'data.frame': 149 obs. of 3 variables:
## $ Email_id: chr "marie.heard@enron.com" "mark.e.taylor@enron.com" "lindy.donoho@enron.com" "lisa.gang@enron.com" ...
## $ lastName: chr "Heard" "Taylor" "Donoho" "Gang" ...
## $ status : chr "N/A" "Employee" "Employee" "N/A" ...
The rest of this document explains how to handle and what you can do with the igraph SNA object in 7 steps
Just insist on the requirement that: - two first columns of edges object match with node id’s - nodes object must contain all info. from nodes in edges object
When creating the graph we can choose if the network is directed or not. In this case we choose it as directed.
# important: for igraph V = vertex . E = edge Note uppercase
# We filtered out the full text for practical reasons, to make it simpler
network.full <- graph.data.frame(edges.full[, c("sender", "receiver", "type",
"date", "subject")], directed = TRUE, vertices = nodes)
class(network.full)
## [1] "igraph"
summary(network.full)
## IGRAPH DN-- 149 61673 --
## attr: name (v/c), lastName (v/c), status (v/c), type (e/c), date
## (e/c), subject (e/c)
# We have created an igraph object and summary will tell us the number of
# nodes and edges. igraph automatically sets as node properties all
# additional columns in node object (name, lastName, status) and as edge
# properties all additional columns apart from node id's (type, date, count)
Best documentation can be found at: http://igraph.sourceforge.net/doc/R/00Index.html http://igraph.sourceforge.net/documentation.html
And also from the unfinished tutorial: http://igraph.sourceforge.net/igraphbook/
# You can access to node and edge properties by means of: V(network) y
# E(network) http://igraph.sourceforge.net/doc/R/iterators.html
V(network.full)[1:10]
## Vertex sequence:
## [1] "marie.heard@enron.com" "mark.e.taylor@enron.com"
## [3] "lindy.donoho@enron.com" "lisa.gang@enron.com"
## [5] "jeff.skilling@enron.com" "lynn.blair@enron.com"
## [7] "kim.ward@enron.com" "kate.symes@enron.com"
## [9] "kay.mann@enron.com" "keith.holst@enron.com"
E(network.full)[1:10]
## Edge sequence:
##
## [1] mary.hain@enron.com -> sean.crandall@enron.com
## [2] mary.hain@enron.com -> mike.swerzbin@enron.com
## [3] mary.hain@enron.com -> robert.badeer@enron.com
## [4] cooper.richey@enron.com -> robert.badeer@enron.com
## [5] mary.hain@enron.com -> m..forney@enron.com
## [6] mary.hain@enron.com -> robert.badeer@enron.com
## [7] mary.hain@enron.com -> mike.swerzbin@enron.com
## [8] jeff.dasovich@enron.com -> james.d.steffes@enron.com
## [9] jeff.dasovich@enron.com -> richard.shapiro@enron.com
## [10] jeff.dasovich@enron.com -> james.d.steffes@enron.com
# And also to its properties
table(V(network.full)$status)
##
## CEO Director Employee In House Lawyer
## 4 14 41 1
## Manager Managing Director N/A President
## 14 3 32 4
## Trader Vice President
## 13 23
Take care with date format: gephi requires it to be a string
write.graph(network, file = "network01.graphml", format = "graphml")
With igraph and get.shortest.paths you can obtain the shortest paths between two nodes.
Thanks to explanation at:
http://sigloxxi.fcie.uam.es/informatica/media/Grafos%20con%20R%20e%20Igraph.pdf
get.shortest.paths(from = V(network.full)$lastName == "Pereira", to = V(network.full)$lastName ==
"Horton", graph = network.full)
## $vpath
## $vpath[[1]]
## [1] 138 11 132
##
##
## $epath
## NULL
##
## $predecessors
## NULL
##
## $inbound_edges
## NULL
nodes[c(138, 11, 132), ]
## Email_id lastName status
## 138 susan.w.pereira@enron.com Pereira Employee
## 11 kenneth.lay@enron.com Lay CEO
## 132 stanley.horton@enron.com Horton President
Diameter of the graph is the length of the largest distance between nodes
diameter(network.full)
## [1] 5
nodes[farthest.nodes(network.full), ]
## Email_id lastName status
## 13 joe.quenet@enron.com Quenet Trader
## 4 lisa.gang@enron.com Gang N/A
## 5 jeff.skilling@enron.com Skilling CEO
Centrality measures are computed and can be added to the node properties table. Basic centrality measure is degree, both in_degree and out_degree (this is a directed graph), and total_degree.
nodes$degree_total <- degree(network.full, v = V(network.full), mode = c("total"))
nodes$degree_in <- degree(network.full, v = V(network.full), mode = c("in"))
nodes$degree_out <- degree(network.full, v = V(network.full), mode = c("out"))
Let’s see who are the top20 for each measure. For total degree (both in and out):
head(nodes[order(nodes$degree_total, decreasing = TRUE), ], n = 20L)
## Email_id lastName status degree_total
## 42 jeff.dasovich@enron.com Dasovich Employee 8610
## 36 james.d.steffes@enron.com Steffes Vice President 5720
## 141 tana.jones@enron.com Jones N/A 5190
## 99 mike.grigsby@enron.com Grigsby Manager 4709
## 125 sara.shackleton@enron.com Shackleton N/A 4708
## 116 richard.shapiro@enron.com Shapiro Vice President 4327
## 134 steven.j.kean@enron.com Kean Vice President 4046
## 2 mark.e.taylor@enron.com Taylor Employee 3477
## 14 louise.kitchen@enron.com Kitchen President 3241
## 55 carol.clair@enron.com Clair Vice President 3114
## 17 kimberly.watson@enron.com Watson N/A 2091
## 133 stephanie.panus@enron.com Panus Employee 2063
## 1 marie.heard@enron.com Heard N/A 2048
## 140 susan.bailey@enron.com Bailey N/A 1918
## 16 liz.taylor@enron.com Taylor N/A 1890
## 115 richard.b.sanders@enron.com Sanders Vice President 1813
## 139 susan.scott@enron.com Scott N/A 1800
## 98 michelle.lokay@enron.com Lokay Employee 1658
## 135 steven.harris@enron.com Harris Vice President 1628
## 93 mary.hain@enron.com Hain N/A 1622
## degree_in degree_out
## 42 1499 7111
## 36 2991 2729
## 141 1633 3557
## 99 693 4016
## 125 2211 2497
## 116 3276 1051
## 134 2476 1570
## 2 2422 1055
## 14 1123 2118
## 55 937 2177
## 17 940 1151
## 133 869 1194
## 1 1066 982
## 140 1486 432
## 16 129 1761
## 115 1332 481
## 139 876 924
## 98 672 986
## 135 1353 275
## 93 461 1161
For degree in:
head(nodes[order(nodes$degree_in, decreasing = TRUE), ], n = 20L)
## Email_id lastName status degree_total
## 116 richard.shapiro@enron.com Shapiro Vice President 4327
## 36 james.d.steffes@enron.com Steffes Vice President 5720
## 134 steven.j.kean@enron.com Kean Vice President 4046
## 2 mark.e.taylor@enron.com Taylor Employee 3477
## 125 sara.shackleton@enron.com Shackleton N/A 4708
## 141 tana.jones@enron.com Jones N/A 5190
## 42 jeff.dasovich@enron.com Dasovich Employee 8610
## 140 susan.bailey@enron.com Bailey N/A 1918
## 135 steven.harris@enron.com Harris Vice President 1628
## 115 richard.b.sanders@enron.com Sanders Vice President 1813
## 49 barry.tycholiz@enron.com Tycholiz Vice President 1494
## 14 louise.kitchen@enron.com Kitchen President 3241
## 1 marie.heard@enron.com Heard N/A 2048
## 17 kimberly.watson@enron.com Watson N/A 2091
## 55 carol.clair@enron.com Clair Vice President 3114
## 139 susan.scott@enron.com Scott N/A 1800
## 133 stephanie.panus@enron.com Panus Employee 2063
## 75 elizabeth.sager@enron.com Sager Employee 1135
## 110 phillip.k.ellen@enron.com Allen Manager 1250
## 96 matthew.lenhart@enron.com Lenhart Employee 1309
## degree_in degree_out
## 116 3276 1051
## 36 2991 2729
## 134 2476 1570
## 2 2422 1055
## 125 2211 2497
## 141 1633 3557
## 42 1499 7111
## 140 1486 432
## 135 1353 275
## 115 1332 481
## 49 1181 313
## 14 1123 2118
## 1 1066 982
## 17 940 1151
## 55 937 2177
## 139 876 924
## 133 869 1194
## 75 816 319
## 110 785 465
## 96 775 534
For degree out:
head(nodes[order(nodes$degree_out, decreasing = TRUE), ], n = 20L)
## Email_id lastName status degree_total
## 42 jeff.dasovich@enron.com Dasovich Employee 8610
## 99 mike.grigsby@enron.com Grigsby Manager 4709
## 141 tana.jones@enron.com Jones N/A 5190
## 36 james.d.steffes@enron.com Steffes Vice President 5720
## 125 sara.shackleton@enron.com Shackleton N/A 4708
## 55 carol.clair@enron.com Clair Vice President 3114
## 14 louise.kitchen@enron.com Kitchen President 3241
## 16 liz.taylor@enron.com Taylor N/A 1890
## 134 steven.j.kean@enron.com Kean Vice President 4046
## 133 stephanie.panus@enron.com Panus Employee 2063
## 93 mary.hain@enron.com Hain N/A 1622
## 17 kimberly.watson@enron.com Watson N/A 2091
## 123 sally.beck@enron.com Beck Employee 1313
## 2 mark.e.taylor@enron.com Taylor Employee 3477
## 116 richard.shapiro@enron.com Shapiro Vice President 4327
## 73 drew.fossum@enron.com Fossum Vice President 1331
## 98 michelle.lokay@enron.com Lokay Employee 1658
## 1 marie.heard@enron.com Heard N/A 2048
## 57 chris.germany@enron.com Germany Employee 1086
## 139 susan.scott@enron.com Scott N/A 1800
## degree_in degree_out
## 42 1499 7111
## 99 693 4016
## 141 1633 3557
## 36 2991 2729
## 125 2211 2497
## 55 937 2177
## 14 1123 2118
## 16 129 1761
## 134 2476 1570
## 133 869 1194
## 93 461 1161
## 17 940 1151
## 123 252 1061
## 2 2422 1055
## 116 3276 1051
## 73 320 1011
## 98 672 986
## 1 1066 982
## 57 131 955
## 139 876 924
Reach is another measure, also known as neighborhood.size. You must specify a specific order (an integer), meaning the total number of people you can reach with that number of steps. We can observe how this metric is very much linked to actual connectivity.
nodes$reach_2_step <- neighborhood.size(network.full, order = 2, nodes = V(network.full),
mode = c("all"))
head(nodes[order(nodes$reach_2_step, decreasing = TRUE), ], n = 30L)
## Email_id lastName status degree_total
## 15 kevin.m.presto@enron.com Presto Vice President 1146
## 16 liz.taylor@enron.com Taylor N/A 1890
## 11 kenneth.lay@enron.com Lay CEO 597
## 26 lavorato@enron.com Lavorato CEO 377
## 123 sally.beck@enron.com Beck Employee 1313
## 14 louise.kitchen@enron.com Kitchen President 3241
## 36 james.d.steffes@enron.com Steffes Vice President 5720
## 68 david.w.delainey@enron.com Delainey CEO 1078
## 110 phillip.k.ellen@enron.com Allen Manager 1250
## 134 steven.j.kean@enron.com Kean Vice President 4046
## 24 m..forney@enron.com Forney Manager 289
## 49 barry.tycholiz@enron.com Tycholiz Vice President 1494
## 88 e..haedicke@enron.com Haedicke Managing Director 1176
## 117 rick.buy@enron.com Buy Manager 439
## 5 jeff.skilling@enron.com Skilling CEO 242
## 63 dana.davis@enron.com Davis Vice President 261
## 85 greg.whalley@enron.com Whalley President 833
## 93 mary.hain@enron.com Hain N/A 1622
## 99 mike.grigsby@enron.com Grigsby Manager 4709
## 116 richard.shapiro@enron.com Shapiro Vice President 4327
## 139 susan.scott@enron.com Scott N/A 1800
## 10 keith.holst@enron.com Holst Director 638
## 25 john.arnold@enron.com Arnold Manager 969
## 75 elizabeth.sager@enron.com Sager Employee 1135
## 80 fletcher.j.sturm@enron.com Sturm Vice President 389
## 141 tana.jones@enron.com Jones N/A 5190
## 148 j.kaminski@enron.com Kaminski Manager 451
## 2 mark.e.taylor@enron.com Taylor Employee 3477
## 97 michelle.cash@enron.com Cash Employee 245
## 115 richard.b.sanders@enron.com Sanders Vice President 1813
## degree_in degree_out reach_2_step
## 15 459 687 146
## 16 129 1761 145
## 11 210 387 144
## 26 6 371 144
## 123 252 1061 143
## 14 1123 2118 142
## 36 2991 2729 142
## 68 556 522 142
## 110 785 465 142
## 134 2476 1570 142
## 24 106 183 141
## 49 1181 313 141
## 88 695 481 141
## 117 328 111 141
## 5 141 101 140
## 63 244 17 140
## 85 769 64 140
## 93 461 1161 140
## 99 693 4016 140
## 116 3276 1051 140
## 139 876 924 140
## 10 614 24 139
## 25 495 474 139
## 75 816 319 139
## 80 256 133 139
## 141 1633 3557 139
## 148 104 347 139
## 2 2422 1055 138
## 97 130 115 138
## 115 1332 481 138
There is a lot of info. about enron employees, ie http://www.inf.ed.ac.uk/teaching/courses/tts/assessed/roles.txt
Other interesting measures are clustering coefficient and transitivity http://en.wikipedia.org/wiki/Clustering_coefficient “The clustering coefficient places more weight on the low degree nodes, while the transitivity ratio places more weight on the high degree nodes”.
nodes$transitivity_ratio <- transitivity(network.full, vids = V(network.full),
type = "local")
head(nodes[order(nodes$transitivity_ratio, decreasing = FALSE), ], n = 20L)
## Email_id lastName status degree_total
## 139 susan.scott@enron.com Scott N/A 1800
## 16 liz.taylor@enron.com Taylor N/A 1890
## 26 lavorato@enron.com Lavorato CEO 377
## 123 sally.beck@enron.com Beck Employee 1313
## 57 chris.germany@enron.com Germany Employee 1086
## 11 kenneth.lay@enron.com Lay CEO 597
## 7 kim.ward@enron.com Ward N/A 1000
## 65 daren.j.farmer@enron.com Farmer Manager 105
## 24 m..forney@enron.com Forney Manager 289
## 52 bill.williams@enron.com Williams N/A 381
## 14 louise.kitchen@enron.com Kitchen President 3241
## 22 kam.keiser@enron.com Keiser Employee 1081
## 84 gerald.nemec@enron.com Nemec N/A 1294
## 56 charles.weldon@enron.com Weldon N/A 99
## 93 mary.hain@enron.com Hain N/A 1622
## 42 jeff.dasovich@enron.com Dasovich Employee 8610
## 62 dan.hyvl@enron.com Hyvl Employee 531
## 15 kevin.m.presto@enron.com Presto Vice President 1146
## 69 debra.perlingiere@enron.com Perlingiere Employee 972
## 99 mike.grigsby@enron.com Grigsby Manager 4709
## degree_in degree_out reach_2_step transitivity_ratio
## 139 876 924 140 0.2116
## 16 129 1761 145 0.2277
## 26 6 371 144 0.2344
## 123 252 1061 143 0.2595
## 57 131 955 130 0.2667
## 11 210 387 144 0.2956
## 7 450 550 135 0.3085
## 65 82 23 123 0.3143
## 24 106 183 141 0.3233
## 52 105 276 117 0.3238
## 14 1123 2118 142 0.3295
## 22 289 792 126 0.3298
## 84 593 701 134 0.3485
## 56 63 36 126 0.3509
## 93 461 1161 140 0.3655
## 42 1499 7111 136 0.3684
## 62 235 296 106 0.3766
## 15 459 687 146 0.3861
## 69 284 688 110 0.3875
## 99 693 4016 140 0.3932
V(network.full)$outdegree <- degree(network.full, mode = "out")
V(network.full)$indegree <- degree(network.full, mode = "in")
V(network.full)$degree <- degree(network.full, mode = "all")
V(network.full)$reach_2_step <- neighborhood.size(network.full, order = 2, nodes = V(network.full),
mode = c("all"))
V(network.full)$transitivity_ratio <- transitivity(network.full, vids = V(network.full),
type = "local")
Extracting parts of a graph using igraph is very easy. You just need to know two functions:
induced.subgraph subgraph.edges
For instance, to extract subgraphs of the most relevant people when enron came into bankruptcy (from info available at: http://es.wikipedia.org/wiki/Enron#Ca.C3.ADda_de_la_empresa (CAVEAT: in spanish, click on english language to see it in this language) )
edges.full$day <- strftime(edges.full$date.R, "%Y-%m-%d")
network.august <- subgraph.edges(network.full, which(as.Date(E(network.full)$date) >
"2001-02-12 00:00:00"), delete.vertices = TRUE)
summary(network.august)
## IGRAPH DN-- 146 42126 --
## attr: name (v/c), lastName (v/c), status (v/c), outdegree (v/n),
## indegree (v/n), degree (v/n), reach_2_step (v/n),
## transitivity_ratio (v/n), type (e/c), date (e/c), subject (e/c)
write.graph(network.august, file = "network2001onwards.graphml", format = "graphml")
For instance let’s see messages from president Kenneth Lay
mails.lay <- edges.full[(edges.full$sender == "kenneth.lay@enron.com" & as.Date(edges.full$date.R) >
"2001-07-01 00:00:00") | (edges.full$receiver == "kenneth.lay@enron.com" &
as.Date(edges.full$date.R) > "2001-07-01 00:00:00"), ]
mails.lay <- mails.lay[order(as.Date(mails.lay$date.R)), ]
nrow(mails.lay)
## [1] 506
See how employees were not aware until last minute of what was going on, in spite of all the stakes they had in the company performance. But of course all depended on the position you had in the company:
mails.lay[rownames(mails.lay) == 3473, ]
## sender receiver type
## 3473 susan.w.pereira@enron.com kenneth.lay@enron.com TO
## subject
## 3473 ENRON - A Study in How So Few Could Screw So Many
## body
## 3473 Mr. Lay-After reading the news of your $60,000,000 to $80,000,000 payout, I am disg=usted and appalled all over again. Unfortunately, that s a daily occurrenc=e.=20I have been employed with Enron since May of 1993, when Enron purchased LRC=. I was very skeptical of Enron back then and was unsure of my future with= the company. As time went on, I got more comfortable with the Enron way a=nd learned how to survive and even be a little successful here. I became a= believer in the things Enron could accomplish, a champion of our hard-nose=d, driven executives (at least some of them). I bought Enron stock and hel=d on to those valuable options.Over the last 18 months, my coworkers and I have viewed the weekly, sometim=es daily, selling of Enron stock and exercising of options by our top execu=tives (past and present and including you). We came up with all kinds of r=easons that the executives would be doing this -- they re so overly compens=ated that they have to cash out some every now and then, divorce settlement=s, mistress settlements, buying a new home in Aspen, buying an island in th=e Caribbean, etc., etc. etc.... We never wanted to admit that they knew so=mething we didn t. Things were great, weren t they? Jeff Skilling told us= that Enron was going to be the "World s Leading Company." He even put the= goofy acronyms on his car license plate. He told us gas traders in Februa=ry how the stock was going to $126 by the end of the year. Now, being the c=ynic that I am, I didn t believe the $126, although I have to admit I was h=opeful; I figured that if Jeff had the audacity to throw out a number that =high, then it was reasonable to expect the stock to be fairly stable, i.e.,= +/- 20%. =20Needless to say, I didn t sell my stock and didn t exercise any more option=s. In fact, I bought more stock when it first started going down. I m afr=aid that there are many more just like me. I m fortunate in that I have ma=ny working years ahead of me (where, I don t know) to try to build up my sa=vings again. Many others are not that fortunate. So many have spent their= entire careers here, helping to build this company up. They were looking =forward to retiring soon and enjoying the fruits of their labor. And let m=e remind you that their retirement accounts were in many cases a lot less t=han a month s compensation for you. Now even that is essentially gone. Ot=her employees were just ready to cash in some of their options to pay for t=heir children s college tuition. The stories are too numerous to list, and= the more I think about it, the more sickened I become.It is painfully obvious to me and my coworkers, as well as the rest of the =industry and Houston, that Enron s executives knew that there were skeleton=s in the closet and began cashing in ahead of this freefall. The employees= and the rest of the world were fed a bunch of half-truths and mystery mumb=o-jumbo. There should be an accounting for this behavior. You and your co=horts have ruined so many lives. Think about that while you re spending yo=ur millions.....Susan PereiraENA Gas Trader
## date date.R day
## 3473 2001-11-13 11:38:01 2001-11-13 11:38:01 2001-11-13
mails.lay[rownames(mails.lay) == 60469, ]
## sender receiver type subject
## 60469 stanley.horton@enron.com kenneth.lay@enron.com TO Difficult times
## body
## 60469 I just wanted to let you know that if there is anything I can do to help I am more than willing to do it. These are difficult times and I am doing alot of floor meetings and table talks with the employees. They clearly do not understand how we got into this situation and just want some face time with Management. With the exception of NEPCO our fourth quarter looks good. I have uncovered alot of issues with NEPCO that I do not think anyone knew existed. I ll know more Friday after a businnes/budget review session.Please let me know if I can help.Stan
## date date.R day
## 60469 2001-10-31 05:57:26 2001-10-31 05:57:26 2001-10-31
Another way of extracting a subgraph, all nodes who had contact with Kenneth Lay
nodes.with.lay <- unique(c(mails.lay$sender, mails.lay$receiver))
network.kenneth.lay <- graph.data.frame(mails.lay[, c("sender", "receiver",
"type", "date", "subject")], directed = TRUE)
summary(network.kenneth.lay)
## IGRAPH DN-- 61 506 --
## attr: name (v/c), type (e/c), date (e/c), subject (e/c)
And now see how many people were in Lay’s neighbourhood. This was the CEO so it was extremely easy for him to reach the whole company in only two steps.
neighborhood.size(network.full, 1, V(network.full)$lastName == "Lay")
## [1] 63
neighborhood.size(network.full, 2, V(network.full)$lastName == "Lay")
## [1] 144
You should know there are recommendations about availability of algorithms for computing communities depending on the type of your graph (directed vs. non-directed): http://igraph.wikidot.com/community-detection-in-r
A classical algorithm is the one by Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, “Fast unfolding of communities in large networks”, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000, and is part of gephi in the function multilevel.community:
http://igraph.sourceforge.net/doc/R/multilevel.community.html
# !!! need to review this
communities <- multilevel.community(network.social)
#str(communities)
# !!! need to review this
comms.df <- data.frame(row.names = seq(1:149))
comms.df$Email_id <- communities$names
comms.df$community <- communities$membership
# Adding each node's community to the nodes table
str(nodes)
## 'data.frame': 149 obs. of 8 variables:
## $ Email_id : chr "marie.heard@enron.com" "mark.e.taylor@enron.com" "lindy.donoho@enron.com" "lisa.gang@enron.com" ...
## $ lastName : chr "Heard" "Taylor" "Donoho" "Gang" ...
## $ status : chr "N/A" "Employee" "Employee" "N/A" ...
## $ degree_total : num 2048 3477 1217 80 242 ...
## $ degree_in : num 1066 2422 620 71 141 ...
## $ degree_out : num 982 1055 597 9 101 ...
## $ reach_2_step : num 124 138 96 30 140 91 135 86 132 139 ...
## $ transitivity_ratio: num 0.732 0.398 0.81 0.9 0.509 ...
nodes.def <- merge(nodes, comms.df, by.x = "Email_id", by.y = "Email_id")
str(nodes.def)
## 'data.frame': 149 obs. of 9 variables:
## $ Email_id : chr "albert.meyers@enron.com" "andrea.ring@enron.com" "andy.zipper@enron.com" "barry.tycholiz@enron.com" ...
## $ lastName : chr "Meyers" "Ring" "Zipper" "Tycholiz" ...
## $ status : chr "Employee" "N/A" "Vice President" "Vice President" ...
## $ degree_total : num 38 142 529 1494 96 ...
## $ degree_in : num 31 97 327 1181 67 ...
## $ degree_out : num 7 45 202 313 29 283 276 21 0 325 ...
## $ reach_2_step : num 89 125 136 141 118 74 117 120 1 63 ...
## $ transitivity_ratio: num 0.833 0.463 0.511 0.41 0.533 ...
## $ community : num 4 8 8 5 8 10 4 8 3 4 ...
head(nodes.def)
## Email_id lastName status degree_total degree_in
## 1 albert.meyers@enron.com Meyers Employee 38 31
## 2 andrea.ring@enron.com Ring N/A 142 97
## 3 andy.zipper@enron.com Zipper Vice President 529 327
## 4 barry.tycholiz@enron.com Tycholiz Vice President 1494 1181
## 5 benjamin.rogers@enron.com Rogers Employee 96 67
## 6 bill.rapp@enron.com Rapp N/A 434 151
## degree_out reach_2_step transitivity_ratio community
## 1 7 89 0.8333 4
## 2 45 125 0.4632 8
## 3 202 136 0.5111 8
## 4 313 141 0.4103 5
## 5 29 118 0.5333 8
## 6 283 74 0.7333 10
plot(table(nodes.def$community))
V(network.social)$community <- communities$membership
There are currently three different functions in the igraph package which can draw graph in various ways:
plot.igraph does simple non-interactive 2D plotting to R devices. Actually it is an implementation of the plot generic function, so you can write plot(graph) instead of plot.igraph(graph). As it used the standard R devices it supports every output format for which R has an output device. The list is quite impressing: PostScript, PDF files, XFig files, SVG files, JPG, PNG and of course you can plot to the screen as well using the default devices, or the good-looking anti-aliased Cairo device
See plot.igraph for some more information. BUT BUT BUT unless you work it out more, basic plot is unusable, in particular for large graphs like the one we are dealing with.
plot(network.social)
First recommendation: plotting a large graph with igraph -and the enron graph is Not huge- is useless. Gephi is an excellent alternative for an interactive plot of high quality.
For showing igraph capabilities fro plotting the graph should be small. Let’s extract the “CEO’s COMMUNITIES”:
str(nodes.def)
## 'data.frame': 149 obs. of 9 variables:
## $ Email_id : chr "albert.meyers@enron.com" "andrea.ring@enron.com" "andy.zipper@enron.com" "barry.tycholiz@enron.com" ...
## $ lastName : chr "Meyers" "Ring" "Zipper" "Tycholiz" ...
## $ status : chr "Employee" "N/A" "Vice President" "Vice President" ...
## $ degree_total : num 38 142 529 1494 96 ...
## $ degree_in : num 31 97 327 1181 67 ...
## $ degree_out : num 7 45 202 313 29 283 276 21 0 325 ...
## $ reach_2_step : num 89 125 136 141 118 74 117 120 1 63 ...
## $ transitivity_ratio: num 0.833 0.463 0.511 0.41 0.533 ...
## $ community : num 4 8 8 5 8 10 4 8 3 4 ...
nodes.def[nodes.def$lastName == "Lay", ]
## Email_id lastName status degree_total degree_in degree_out
## 72 kenneth.lay@enron.com Lay CEO 597 210 387
## reach_2_step transitivity_ratio community
## 72 144 0.2956 8
nodes.def[nodes.def$community == 8, c(2:9)]
## lastName status degree_total degree_in degree_out
## 2 Ring N/A 142 97 45
## 3 Zipper Vice President 529 327 202
## 5 Rogers Employee 96 67 29
## 8 Mckay Employee 147 126 21
## 12 Weldon N/A 99 63 36
## 13 Dorland Manager 243 80 163
## 14 Germany Employee 1086 131 955
## 15 Stokley N/A 60 56 4
## 16 Richey Manager 97 91 6
## 19 Davis Vice President 261 244 17
## 21 Farmer Manager 105 82 23
## 23 Delainey CEO 1078 556 522
## 26 Baughman Trader 346 112 234
## 27 Gilbert-smith Employee 350 288 62
## 29 Quigley Trader 528 378 150
## 34 Saibi Trader 168 160 8
## 35 McLaughlin Employee 845 275 570
## 36 Sturm Vice President 389 256 133
## 39 Storey Director 212 181 31
## 41 Whalley President 833 769 64
## 43 Arora Vice President 130 115 15
## 45 Shively Vice President 605 472 133
## 46 Kaminski Manager 451 104 347
## 54 King Manager 104 94 10
## 55 Skilling CEO 242 141 101
## 56 Shankman President 512 296 216
## 57 Schwieger Trader 163 66 97
## 58 Parks N/A 162 137 25
## 59 Quenet Trader 38 32 6
## 60 Stepenovitch Vice President 100 81 19
## 61 Arnold Manager 969 495 474
## 62 Griffith Manager 503 245 258
## 63 Hodge Managing Director 160 121 39
## 64 Zufferli Vice President 213 170 43
## 65 Mckay Director 242 154 88
## 66 Hernandez Employee 135 108 27
## 67 Townsend Employee 415 395 20
## 72 Lay CEO 597 210 387
## 74 Presto Vice President 1146 459 687
## 75 Ruscitti Trader 81 68 13
## 79 Campbell Employee 158 111 47
## 80 May Director 357 272 85
## 81 Lavorato CEO 377 6 371
## 84 Taylor N/A 1890 129 1761
## 85 Kitchen President 3241 1123 2118
## 87 Forney Manager 289 106 183
## 101 Carson Manager 133 118 15
## 103 Maggi Director 344 331 13
## 104 Swerzbin Trader 171 158 13
## 108 Thomas N/A 104 69 35
## 111 Keavey Employee 143 102 41
## 116 Ring Employee 44 37 7
## 118 Buy Manager 439 328 111
## 120 Benson Director 124 120 4
## 121 Rodrigue N/A 61 19 42
## 124 Beck Employee 1313 252 1061
## 125 Brawner Director 222 170 52
## 127 Hendrickson N/A 221 184 37
## 128 Neal Vice President 879 429 450
## 131 White N/A 278 189 89
## 145 Martin Vice President 365 310 55
## 146 Donohoe Employee 37 29 8
## 149 Pimenov N/A 90 76 14
## reach_2_step transitivity_ratio community
## 2 125 0.4632 8
## 3 136 0.5111 8
## 5 118 0.5333 8
## 8 120 0.5108 8
## 12 126 0.3509 8
## 13 130 0.4444 8
## 14 130 0.2667 8
## 15 122 0.5091 8
## 16 125 0.4615 8
## 19 140 0.5167 8
## 21 123 0.3143 8
## 23 142 0.4634 8
## 26 136 0.5556 8
## 27 137 0.5524 8
## 29 127 0.4238 8
## 34 135 0.6944 8
## 35 117 0.5619 8
## 36 139 0.4354 8
## 39 126 0.4123 8
## 41 140 0.5947 8
## 43 131 0.6199 8
## 45 137 0.4297 8
## 46 139 0.4824 8
## 54 130 0.7308 8
## 55 140 0.5095 8
## 56 136 0.6527 8
## 57 129 0.4895 8
## 58 127 0.4231 8
## 59 116 0.7121 8
## 60 102 0.7619 8
## 61 139 0.4836 8
## 62 121 0.4420 8
## 63 97 0.5273 8
## 64 126 0.5359 8
## 65 126 0.4375 8
## 66 126 0.7556 8
## 67 117 0.6889 8
## 72 144 0.2956 8
## 74 146 0.3861 8
## 75 109 0.5000 8
## 79 106 0.4889 8
## 80 131 0.5714 8
## 81 144 0.2344 8
## 84 145 0.2277 8
## 85 142 0.3295 8
## 87 141 0.3233 8
## 101 131 0.6813 8
## 103 133 0.5809 8
## 104 137 0.4737 8
## 108 129 0.6410 8
## 111 121 0.5543 8
## 116 73 0.5000 8
## 118 141 0.6367 8
## 120 132 0.6471 8
## 121 120 0.5909 8
## 124 143 0.2595 8
## 125 125 0.5455 8
## 127 115 0.6444 8
## 128 138 0.4010 8
## 131 132 0.4067 8
## 145 128 0.5543 8
## 146 120 0.5556 8
## 149 121 0.5667 8
com.ceos <- induced.subgraph(network.social, V(network.social)$community ==
8, impl = "auto") # Ver ayuda
summary(com.ceos)
## IGRAPH UNW- 63 526 --
## attr: name (v/c), lastName (v/c), status (v/c), degree_total
## (v/n), degree_in (v/n), degree_out (v/n), reach_2_step (v/n),
## transitivity_ratio (v/n), community (v/n), weight (e/n)
Again, unless you make extensive use of plot options, or have an incredibly large screen -or document page- the plotted graphs are not usable:
g <- com.ceos
plot(g)
By default (no layout), nodes are projected on random co-ordinates, with automatic labels starting by 0, correlative numbers afterwards
How to fix a layout in a plot: fix l
l <- layout.random(g)
plot(g, layout = l)
What is a layout? Extracted from igraph help (?layout)
“A Layout is either a function or a numeric matrix. It specifies how the vertices will be placed on the plot. If it is a numeric matrix, then the matrix has to have one line for each vertex, specifying its coordinates. The matrix should have at least two columns, for the x and y coordinates, and it can also have third column, this will be the z coordinate for 3D plots and it is ignored for 2D plots. If a two column matrix is given for the 3D plotting function rglplot then the third column is assumed to be 1 for each vertex. If layout is a function, this function will be called with the graph as the single parameter to determine the actual coordinates. The function should return a matrix with two or three columns. For the 2D plots the third column is ignored”.
Let’t try to improve the position of objets in plain
Let’s put name as the node label
V(g)$label <- V(g)$lastName
plot(g, layout = layout.fruchterman.reingold, vertex.label.font = 1, vertex.label.cex = 0.8,
edge.arrow.size = 0.3, vertex.size = 12, vertex.color = "yellow")
plot(g, layout = layout.kamada.kawai)
# color the edges
par(bg = "#000000", mar = c(1, 1, 1, 1), oma = c(1, 1, 1, 1))
edge_col <- colorpanel(length(table(E(g)$weight)), low = "#2C7BB6", high = "#FFFFBF")
E(g)$color <- edge_col[factor(E(g)$weight)]
plot(g, main = "enron", layout = layout.fruchterman.reingold(g, params = list(niter = 1000,
weights = E(g)$weight)), vertex.label = V(g)$label, vertex.size = log10(as.numeric(V(g)$degree_total)),
vertex.label.font = 1, vertex.label.color = "white", vertex.label.cex = 0.8,
edge.arrow.size = 0.3, vertex.color = "yellow", edge.arrow.size = E(g)$weight/150,
edge.width = 1.5 * log10(E(g)$weight), edge.curved = T, edge.color = E(g)$color)
plot(g, main = "enron", layout = layout.kamada.kawai(g, params = list(niter = 1000,
weights = E(g)$weight)), vertex.label = V(g)$label, vertex.size = 12, vertex.label.font = 1,
vertex.label.color = "black", vertex.label.cex = 0.8, edge.arrow.size = 0.3,
vertex.color = "yellow", edge.arrow.size = E(g)$weight/150, edge.width = 1.5 *
log10(E(g)$weight), edge.curved = T, edge.color = E(g)$color)
# Other layouts
par(bg = "#FFFFFF", mar = c(1, 1, 1, 1), oma = c(1, 1, 1, 1))
plot(g, main = "enron", layout = layout.reingold.tilford, vertex.label = V(g)$label)
Other layouts Reingold.tilford produces a hierarchical graph
plot(g, main = "enron", layout = layout.lgl, vertex.label = V(g)$label)
layout.circle produce gráficos de cuerdas o chord plots
l <- layout.circle(g)
# use colour functions
par(bg = "#000000", mar = c(1, 1, 1, 1), oma = c(1, 1, 1, 1))
edge_col <- colorpanel(length(table(E(g)$weight)), low = "#2C7BB6", high = "#FFFFBF")
E(g)$color <- edge_col[factor(E(g)$weight)]
plot(g, layout = l, vertex.label = V(g)$label, vertex.size = 1, vertex.label.color = "white",
edge.width = 1.5 * log10(E(g)$weight), edge.curved = F, edge.color = E(g)$color)
glplot is an experimental function to draw graphs in 3D using OpenGL. This cannot be shown in a document, so only the code is presented. If you run the code from within an R or RStudio session, a new window will appear and you will find the graph there.
rglplot(g, layout = layout.sphere)
Same with tkplot. If you run the code from within an R or RStudio session, a new window will appear and you will find the graph there.
tkplot does interactive 2D plotting using the tcltk package. It can only handle graphs of moderate size, a thousend vertices is probably already too many. Some parameters of the plotted graph can be changed interactively after issuing the tkplot command: the position, color and size of the vertices and the color and width of the edges. See tkplot for details.
tkplot(g)